Identification and Translation of Significant Patterns for Cross-Domain SMT Applications
نویسندگان
چکیده
Adaptation of statistical machine translation (SMT) systems from generic to specific domains is challenging due to the lack of training data. In this paper we propose a framework for domain adaptation by exploiting a large monolingual in-domain corpus. We identify the significant patterns to capture the domain specific writing styles. The patterns are then translated with the involvements of domain experts. The major issue of our framework is to reduce the cost of the experts and better allocate their efforts. The experimental results show the proposed methods are effective, in terms of the significance and diversity of the patterns. The approaches to integrate the mined patterns into background SMT are also discussed.
منابع مشابه
A Simplification-Translation-Restoration Framework for Cross-Domain SMT Applications
Integration of domain specific knowledge into a general purpose statistical machine translation (SMT) system poses challenges due to insufficient bilingual corpora. In this paper we propose a simplification-translation-restoration (STR) framework for domain adaptation in SMT by simplifying domain specific segments of a text. For an in-domain text, we identify the critical segments and modify th...
متن کاملA Hybrid Machine Translation System Based on a Monotone Decoder
In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...
متن کاملDynamically Integrating Cross-Domain Translation Memory into Phrase-Based Machine Translation during Decoding
Our previous work focuses on combining translation memory (TM) and statistical machine translation (SMT) when the TM database and the SMT training set are the same. However, the TM database will deviate from the SMT training set in the real task when time goes by. In this work, we concentrate on the task when the TM database and the SMT training set are different and even from different domains...
متن کاملCross-Domain and Cross-Language Porting of Shallow Parsing
English was the main focus of attention of the Natural Language Processing (NLP) community for years. As a result, there are significantly more annotated linguistic resources in English than in any other language. Consequently, data-driven tools for automatic text or speech processing are developed mainly for English. Developing similar corpora and tools for other languages is an important issu...
متن کاملBuilding Compact Lexicons for Cross-Domain SMT by Mining Near-Optimal Pattern Sets
Statistical machine translation models are known to benefit from the availability of a domain bilingual lexicon. Bilingual lexicons are traditionally comprised of multiword expressions, either extracted from parallel corpora or manually curated. We claim that “patterns”, comprised of words and higher order categories, generalize better in capturing the syntax and semantics of the domain. In thi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011